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AN ADAPTIVE INDEPENDENCE SAMPLER MCMC ALGORITHM 
FOR INFINITE-DIMENSIONAL BAYESIAN INFERENCES* 

ZHE FENGt AND JINGLAI LI* 


Abstract. Many scientific and engineering problems require to perform Bayesian inferences in 
function spaces, in which the unknowns are of infinite dimension. In such problems, many standard 
Markov Chain Monte Carlo (MCMC) algorithms become arbitrary slow under the mesh refinement, 
which is referred to as being dimension dependent. In this work we develop an independence sampler 
based MCMC method for the infinite dimensional Bayesian inferences. We represent the proposal 
distribution as a mixture of a finite number of specially parametrized Gaussian measures. We show 
that under the chosen parametrization, the resulting MCMC algorithm is dimension independent. 
We also design an efficient adaptive algorithm to adjust the parameter values of the mixtures from 
the previous samples. Finally we provide numerical examples to demonstrate the efficiency and 
robustness of the proposed method, even for problems with multimodal posterior distributions. 

Key words, adaptive Markov chain Monte Carlo, Bayesian inference, Gaussian mixture, inde¬ 
pendence sampler, inverse problem 


1. Introduction. Nonparametric Bayesian inferences have applications in many 
scientific problems, ranging from regression m to inverse problems [niisi]. In those 
problems the unknown that we want to infer is of inhnite-dimension, for example, a 
function of space or time. In many practical problems, the posterior distributions do 
not admit a closed form and need to be computed numerically. Specifically one first 
represents the unknown function with a finite-dimensional parametrization, for exam¬ 
ple, by discretizing the function on a pre-determined mesh grid, and then solve the 
resulting finite dimensional inference problem with the Markov Chain Monte Carlo 
(MCMC) simulations. It has been known that standard MCMC algorithms, such as 
the random walk Metropolis-Hastings (RWMH), can become arbitrarily slow as the 
discretization mesh of the unknown is refined |311l33L [Sll26|. That is, the mixing time of 
an algorithm can increase to infinity as the dimension of the discretized parameter ap¬ 
proaches to infinity, and in this case the algorithm is said to be dimension-dependent. 
To this end, a very interesting line of research is to develop dimension-independent 
MCMC algorithms by requiring the algorithms to be well-defined in the function 
spaces. In particular, a family of dimension-independent MCMC algorithms were 
presented in [5] by constructing a preconditioned Crank-Nicolson (pCN) discretiza¬ 
tion of a stochastic partial differential equation (SPDE) that preserves the reference 
measure. 

Just like its finite dimensional counterparts, the sampling efhciency of the infinite 
dimensional MCMC can be improved by incorporating the data information in the 
proposal design. One way of doing so is to guide the proposal with the local deriva¬ 
tive information of the likelihood function. Methods in this category include: the 
stochastic Newton MCMC [351 [55]; the operator-weighted proposal method [30], the 
infinite-dimensional Metropolis-adjusted Langevin algorithm (MALA) [71[S], and the 
dimension-independent likelihood-informed (DILI) MCMC [5], just to name a few. An 
alternative type of methods to improve the efhciency with the data information is the 
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adaptive MCMC (c.f. [T1I2I32] and the references therein), which automatically ad¬ 
justs the proposal as the algorithm proceeds. While the first type of approaches utilize 
the gradient or the Hessian of the likelihood function to accelerate the computation, 
the adaptive methods do not require such information, which makes it particularly 
convenient for problems with black-box models. 

In this paper we propose an adaptive MCMC algorithm with independence sam¬ 
pler (IS) |35] for infinite dimensional problems. IS, also known as the independent 
Metropolis-Hastings (MH) [TB], or the Metropolized independent sampling [53], is an 
alternative to the popular RWMH algorithm, which proposes from a stationary dis¬ 
tribution, i.e., one that is independent of the present position. The design principle 
for the independence sampler method is rather straightforward: loosely speaking, one 
should choose the proposal distribution to be as close to the target distribution as 
possible. The basic idea here is to represent the proposal distribution with a mixture 
of a finite number of parametrized Gaussian measures, and optimize the parameters 
as the algorithm proceeds. Our specific parametrization ensures the algorithm to be 
well defined in function spaces and therefore dimension independent. As is mentioned 
earlier, a major advantage of the proposed method is that it can propose efficiently 
without using the derivative information of the likelihood function. Moreover as is 
demonstrated by our numerical examples in Section 5, our method performs well for 
multimodal posterior distributions which can be challenging for many existing algo¬ 
rithms. 

The rest of the paper is organized as the following. In section we introduce 
the basic setup of the infinite dimensional Bayesian inference problem. In section [^ 
we present the Gaussian mixture based independence sampler and show that it is 
well-defined in the function space. Section 4 is devoted to a detailed description of 
the complete algorithm and and section 5 provides several numerical examples of the 
proposed method. 

2. Problem setup. We consider a separable Hilbert space X with inner product 
(•, ■)x- Our goal is to estimate the unknown u G X from data y GY where Y is the 
data space and y is related to u via the likelihood function 

L{u,y) = ^exp(-$^(M)), 

where Z is a normalization constant. In what follows, without causing any ambiguity, 
we shall drop the superscript y in for simplicity. In this work we require the 
functional $ satisfies the Assumptions (6.1) in [8|, i.e., 

(a) there exists g > 0, Q > 0 such that, for all u G X, 

0<Hu)<Q{l + \\urx); 

(b) for every r > 0 there is > 0 such that, for all u, v G X with 

max{||M||x, lli'llx:} < r. 


|$(m) - $(u)| < Qr\\u - wIIjc. 

We do not have any restrictions on the space Y. 

In the Bayesian inference we assume that the prior /tq of u, is a (without loss of 
generality) zero-mean Gaussian measure defined on X with covariance operator Co, 
i.e. /To = N{0,Cq). Note that Co is symmetric positive and of trace class. The range 
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of C, 


0 I 


E = {u = C^x\xG X} C X, 


which is a Hilbert space equipped with inner product cni, 


is called the Cameron-Martin space of measure /tq. In this setting, the posterior 
measure fE of u conditional on data y is provided by the Radon-Nikodym derivative: 


dyE, . 


1 

z 


exp(-$(M)), 


( 2 . 1 ) 


which can be interpreted as the Bayes’ rule in the infinite dimensional setting. Our 
goal is to draw samples from the posterior with MCMC algorithms. 

Note that the definition of the maximum a posteriori (MAP) estimator in finite 
dimensional spaces does not apply here, as the measures and /tq are not absolutely 
continuously with respect to the Lebesgue measure; instead, the MAP estimator in 
X is defined as the minimizer of the Onsager-Machlup functional (OMF) [TTl I21j : 


/(u):=$(u) + ^||u|||, (2.2) 

over the Cameron-Martin space E. In Section we shall use OMF as an indicating 
quantity to compare the performance of various MCMC algorithms. Finally we quote 
the following lemma (sni, Chapter 1), which will be useful in next section: 

Lemma 2.1. There exists a complete orthonormal basis {efcjfegN on X and a se¬ 
quence of non-negative numbers {ofcjfcgN such that CoCk = cxkCk and < oo, 

i.e., {efejfcgN and being the eigenfunctions and eigenvalues of Cq respec¬ 

tively. 

Without loss of generality, we assume that the eigenvalues are in a de¬ 

scending order. 


3. Gaussian mixture based independence sampler. In this section, we 
present our Gaussian mixture based independence sampler and show that it is well- 
defined in the function space. 


3.1. Independence sampler MCMC. We start by briefly reviewing the in¬ 
dependence sampler MCMC algorithm. Given a proposal distribution /x, we define 
measures 


v{du,du') = fi{du')pL^ (du), 
v\du,du') = p,{du)p}' {du'), 


on the product space X x X. When is absolute continuous with respect to we 
can define the acceptance probability [36] 


where 


dv 


di'K 

(3.1) 


(3.2) 

as follows in each iteration: 
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1. Draw a sample Uproposed from the proposal /r. 

2. Let Un0xt — I^proposed with probability ^(llcurrent; I^proposed) d-Ild 
^next — Ilcurrent wlth probability 1 ^(rtcurrent; ^proposed) ■ 

We reinstate that the function space IS algorithm Is well-defined if and only 
if is absolutely continuous with respect to v, which requires that /r and are 
equivalent to each other. Since /r*' and are equivalent, it suffices to require /i and 
fjLQ to be equivalent. Interestingly, the pCN scheme with a specific choice of parameter 
values yields a dimension-independent IS whose proposal distribution is simply the 
prior. Despite its dimension-independence property, simply proposing according to 
the prior is inefficient when the data is highly informative, i.e., the posterior being far 
apart from the prior. Next we shall introduce a more efficient proposal measure than 
the prior that is to be used in IS MCMC algorithms. 

3.2. Gaussian mixture proposals. In finite dimensional Bayesian inference 
problems, Gaussian mixture (GM) distributions [57] are often used as the IS proposal 
distributions for their flexibility and convenience to draw samples from. We now 
extend the use of GM to the infinite dimensional setting. Let be a set of 

Gaussian measures on X with /i^ = N(rnj,Cj) for j = 1...J, and we define the 
Gaussian mixture proposal as 


,7 


^{dx) = Wj^j{dx) 

i=i 

(3.3) 

where {wj}j^i are the mixing weights with = 1- 

is equivalent to ^0 as long as each is equivalent to /xq, 

Nikodym derivative of /x to no is 

It should be clear that /x 
and moreover the Radon- 

dl^ , s, d^j 

(3.4) 

Next we discuss our parametrization of each /x^. First recall that, according to 
Lemma 2.1 {efcjfcGN form a complete basis set of X. Our parametrization of 
is in the form of: 

00 

Tflj — ^ ^ ^j,kC^k^k ; 

(3.5a) 

C-^ = C^^+H, 

(3.5b) 


where each Hj is defined as 


Hj * — ^ ^ ‘)o 7 ; (3.5c) 

k=l 

and Xj^k and hj^k are coefficients. The following Theorem provides a sufficient condi¬ 
tion for fXj = XXrrij, Cj) to be a well defined Gaussian measure on X and equivalent 
to ^ 0 - 

Theorem 3.1. IfXj^hj G I2, and fox all k gN, fij = XJjUj^Cj) is 

a Gaussian measure on X that is equivalent to 
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Proof. We let {f3j^k}kGN be the eigenvalues of Cj, i.e, CjCk = (3j,kek for all /c C N. 
And it is easy to see that, 


j^j.k — fo ^j.k') — 


Ctk 


1 + Ukhj^k 


(3.6) 


As Xj,hj € h, i+o!uh k bounded and thus J2T=il^3,k < oo. It follows that Cj G 
L^(X) and /j,j = Offnij^Cj) defines a Gaussian measure on X. 

We now show that ^Mj is equivalent to /xq. First we introduce /x' = 0\i{Q,Cj). 
Using Eq. (3.61 and hj^k > /c S N, we can get 


E 


(/?i.fc - akf _ ^ 


= E 


^ (2 + Oikhj^k) 


< E 


fc = l 


as limfc-j-oo = 0 and hj G l 2 - By the Feldman-Hajek theorem m, we have /x' 

1 1 

is equivalent to /xg. Now recall that mj G E = Cq {X) = Cj{X), and so we have 
Mj- = A)(0, Cj) and /x^- = Xi[mj,Cj) are equivalent, which completes the proof. □ 

Let us assume for now that the conditions in Theorem l3.1l is satisfied and we shall 
verify this assumption later. It is easy to show that 


d^j 

d/Xg 


{u) 


j^°ji/2 exp(-^l|Cj + {u,Cj ^mj)x - ^{u,Hju)x) 


n 


/ ak 

.. OO 

^ V 

'' 2 , X . , „,2 

2afc \ 

V ^3,k 

2 ^ 


Xj^f^Uf^ 1 

Pj.k J J 


(3.7) 


where Uk = {u,ek) is the projection of u onto e^,. Note that the density d^j/dfko 
actually depends on rrij and hj, and thus for convenience’s sake, we define a function 
/(•, •, •) such that 


f{u,Xj,hj) = ^(u), 


and we then can derive from Eq (3.4) that 
d^E 


1 


d^ 


(u) = -exp(-d>(u))/(y]u>j/(?x,xj,/ij)). 




and the density d^/d^y can be computed accordingly. 

3.3. Minimizing the Kullback-Leibler divergence. Now recall that for the 
algorithm to be efficient we need the proposal ^ to be close to and a natural choice 
is to determine /x by minimizing the Kullback-Leibler divergence (KLD) between ijfl 
and 


DKLih^Wh) = 


^og^iu)fi^{du), 

dfi 


(3.8) 


where /x is parametrized with Eq. (3.5). Note that Xj and hj are set to be of infinite 


dimensions in the formulation above. In numerical simulations, however, Xj and hj 
must be truncated at some finite number K. Such a truncation is also reasonable 
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from a practical point of view. In fact, one often can realistically assume that the 
data is only informative on a finite number of directions BE in X, and under this 
assumption, we only need to keep a finite number of components of each xj and hj. 
We emphasize that K which represents the number of dimensions that are informed 
by the data (i.e., the so-called intrinsic dimensionality), should not be confused with 
the discretization dimensionality of the problem, i.e., the number of mesh points used 
to represent the unknown. Determining the value of K is an important task for our 
algorithm and here we choose K with a heuristic approach: 

K = min{/c G N | — < e}, 

ai 

where e is a prescribed threshold. In what follows, we shall adopt this finite, K- 
dimensional formulation, and thus we have the following optimization problem: 


mm 

{xj, hj , tOj G [0,1]} 




(3.9) 


subject to J2j=i '^3 = 1- By some elementary calculations, we can show that Eq. (3.91 
is equivalent to 


mm 

{xj , hj , Wj G [0 


..u-ht 


Wjf{u,Xj,hj)]^iy{du), 


(3.10) 


subject to '^3 — 1- We now show that the proposal /r constructed this way is 
well-defined in function space, and to this end we have the following corollary. 

Corollary 3.2. If {xj,hj,Wj}j^i is a solution of Eq. (3.101 , the resulting ij. is 
equivalent to qiQ. 


Proof. It is obvious that if {xj,hj,Wj}j^i is a solution of Eq. (3.10), Xj,hj G h- 
Taking partial derivative of the objective function in Eq. (3.10) with respect to hj^k 
and setting it to be zero yields the following equation: 


Wjf{u,Xj,hj) 


-dyf 




As the following two integrals are obviously positive: 


Wjf{u,Xj,hj){akXj^k - 
T.i=iWif{u,xi,hi) 


Wjf(u,Xj,hj) 

Y.i=iWifiu,Xi,hi) 


d^y > 0, and 


Wjf{u,Xj, hj){akXj^k - Ukf 
Y.i=iWif{u,xuhi) 


)dfj.y > 0, 


we have 1 -I- athj^k > 0. Thus all the conditions of Theorem |3.1| are satisfied and the 
corollary follows immediately. □ 

Finally we note that, in the special case where J=l, namely the proposal be¬ 
ing simply a Gaussian distribution, our parametrization is similar to the finite rank 
representation used in [23 [3D]. In fact, the aforementioned works also proposed to ap¬ 
proximate the posterior with a Gaussian distribution by minimizing the KLD between 
the two distributions. The major difference is the KLD (recall that it is asymmetric) 
formulation: the authors of [13130] compute the divergence from the Gaussian ap¬ 
proximation to the true posterior, while here we compute the divergence the other way 
around. An advantage of the present formulation is that the solution to Eq. (3.10) 
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can be explicitly obtained: 


hk 



f {akXk - UkYd^iy Uk ’ 


(3.11a) 

(3.11b) 


for k = while in their formulation the resulting optimization problem has to be 


solved with a stochastic optimization algorithm. The explicit solutions (3.111 are of 
essential importance in our adaptive algorithm. 

4. The adaptive algorithm. In this section we discuss the algorithm to im¬ 
plement the IS method proposed in Section starting with an introduction to the 
adaptive MCMC. 


4.1. Adaptive MCMC. The basic idea of the adaptive MCMC is to repeatedly 
adjust the proposal parameters using the information in the previous samples. Here 
we are focused on the adaptive algorithms with IS dn na ng [i2], while noting 
that other types of adaptive algorithms include the adaptive MH the adaptive 
MALA [ai21j, and the adaptive Metropolis-within-Gibbs (MwG) [32]. Specifically 
our adaptive algorithm has the following three key ingredients. First, to enforce 
the asymptotic ergodicity, we terminate the adaptation in a finite number of steps. 
Secondly we use a tempered pre-run to obtain the initial parameter values for the 
iteration. Simply speaking the technique of tempering is to construct a sequence 
of intermediate distributions that converge to the true posterior and use these 
intermediate distributions to guide the MCMC samples to the true posterior. This 
strategy is particularly useful for multimodal posterior distributions. Without loss of 
generality, we assume that the tempering distributions are augmented by a tempering 
parameter A: 


—-cx exp(—A4>(u)), 

dfJ-o 

and clearly when A = 1 and the tempering distribution is “wider” than the 

true posterior for 0 < A < 1. In practice we can choose a finite number of tempering 
parameters where 0 < Ai < A 2 < ... < = I. We also note that for 

problems where the posterior is not too far apart from the prior, tempering may not 
be necessary. Finally we estimate and update the proposal parameters after every 
fixed number of iterations. The adaptive scheme is summarized as the following: 

• Initialization: the total number of iterations /toi, the number of adapted 

iterations /adp, the number of pre-run (tempering) iterations /temp, a set of 
tempering parameters { , the number of samples used in each tempered 

iteration Wemp, the number of samples in each iteration Ns- 

• Pre-run (optional): let = fio; for i = 1 : /temp perform: 

1. Run MCMC with proposal M(i-i) to draw a set of Atemp samples from 

denoted by S'. 

2. Update the parameter values with samples S'' obtaining proposal 

• Iteration: let S = 0 and ^( 0 ) = )( for i= 1 to /toi perform: 

1. Run MCMC with proposal M(i-i) to draw a set of Ns samples from /x*', 
denoted by Si. Let S = S iJ Si. 
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2. If J < /adp, update the parameter values with samples S obtaining pro¬ 
posal otherwise, let 

The adaptive algorithm presented above is rather simple; we note, however, that 
our method is rather flexible and one can pair it with any desired adaptive IS algo¬ 
rithm. A key step in the adaptive algorithm is to estimate the parameters from the 
samples, which is done by solving the sample average estimator of the optimization 
problem (|3.10|): 


max 


N J 

log[^ Wi/(u”, 

n=l j=l 


(4.1) 


subject to X])=i = 1- Next we discuss two methods to solve Eq. (4.11. 


4.2. Expectation Maximization algorithm. The expectation maximization 
(EM) is one of the most popular methods to determine the parameters in mixture 
models [27]. Simply put, the EM algorithm iteratively updates the parameter values 
in a way that the function value is always increased until convergence is achieved. 
Each iteration consists of an Expectation-step and a Maximization-step. It should be 
noted that, the EM algorithm, is not guaranteed to converge to the optimal solutions 
in general m- The theory and implementation details of the EM algorithm and its 
application to mixture models can be found in the aforementioned references , and 
we shall not repeat them here. When applied to our problem, the update formula 
in each iteration can be explicitly obtained. In the Expectation-Step, the member¬ 
ship probability q^, namely the probability that a sample w" is in the mixture j, is 
computed. 


^ u.,nu\h,.,n,) ^ 

e1=i 

for each j = 1...J and n = in the Maximization-Step, the parameter values are 

updated using the following equations: 


N 


= -y 

N ^ 


2 = 1 


N 


Xj k = qluk 

N N 


= Y ^7CY <i7('^kXj,k -uif) 1 , 

—1 —1 


n—1 71—1 


(4.3a) 


(4.3b) 


(4.3c) 


where u'^ = (u", e^). The EM algorithm is arguably the most common method to esti¬ 
mate the parameters of mixtures. However, our numerical tests indicate that in some 
practical problems the EM algorithm is not sufficiently reliable especially when the 
sample set only contains a small number of accepted draws. Moreover, our algorithm 
frequently updates the proposal parameters, which makes the computationally inten¬ 
sive EM algorithms less attractive from an efficiency perspective. For these reasons, 
we propose an alternative method to EM, which estimates the mixture parameters 
using clustering. 
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4.3. Estimating parameters with clustering. Our estimation method with 
clustering is largely based on the finite dimensional method developed in |13j . The 
idea is rather simple: one first partitions the samples into several clusters and then fit 
each cluster with a Gaussian distribution. A difficulty here is that our MCMC samples 
are of infinite dimension, which makes clustering challenging. To solve the problem, 
we first project the samples onto the K eigenfunctions of the covariance operator and 
then cluster the resulting K dimensional data {(«!, ••■, = (u",efc). 

Specifically we use the k-means algorithm to cluster the data, and the number of 
clusters J is determined with the Bayesian information criteria (BIC) method [57] • In 
fact we have found in our numerical tests that the algorithm is rather robust against 
the number of clusters. We then use the Gaussian distribution parametrized in the 
form of Eq (3.5) to fit each cluster, and thanks to Eq. (3.11), the parameters values 
can be estimated explicitly as. 






NjUk 








1 

Oik ’ 


(4.4a) 


(4.4b) 


where is the j-th cluster of samples, Nj is the sample size of 0j, for j = 1... J and 
k — The mixture weights are simply determined by the fraction of samples 

in each cluster. We note that the clustering based method does not generally yield 
a solution to Eq. (4.1) and thus we regard it as an approximate method to estimate 
the parameters. We conclude the section with a pseudo code (Algorithm of our 
algorithm, and interested readers can use it as a basis for their own implementation. 


5. Numerical examples. 

5.1. An ordinary differential equation example. Our first example is a 
simple inverse problem where the forward model is governed by an ordinary differential 
equation (ODE): 


= -u{t)x(t) (5.1) 

with a prescribed initial condition. We assume that the solution x{t) is observed at 
several times in the interval [0, T] and we want to infer the unknown coefficient u{t) 
for t G [0,T]. 

In our numerical experiments, we let the initial condition be 77 ( 0 ) = 1 and T = 1. 
Now suppose that the solution is measured every T/20 time unit from 0 to T and 
the error in each measurement is assumed to be an independent zero-mean Gaussian 
random variable with variance 0.05^. In the computation, 100 equally spaced grid 
points are used to represent the unknown. Moreover, we assume that the state space 
for u is X = L2([0,T]) and the prior is a zero-mean Gaussian measure in X with an 
exponential covariance function: 


=exp(-^^^). 


(5.2) 


The true coefficient u(t) is a realization from the prior (shown in Eig. 5.1) and the 
data is simulated accordingly. 
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input : /temp, ^temp, Ntol, iVmax, N^dp- 

output: TVtoi samples drawn from fj}': 

^ <r- ^q; 

for i ^ 1 to /temp do 
draw ^ /r; 
for n ^ 1 to A^temp do 
draw u' ^ fj,] 

draw a ^ 17[0,1] and compute 

/I . • n f '\ 

A ^ mm{l, —- (u )- 


, ,n—1 


dn d^y^- 


)}; 


if A > a then u^‘ 


u'] else 


end 


cluster {m°, ..., into J subsets: ; 

for j ^ 1 to J do 

Nj ^ sample size of Qj, Wj ^ Nj/Nte^p-, 
compute parameters Xj and hj using Eqs. (4.4); 
compute using Eq. (3.7); 
end 

end 

draw vP ^ 

for n •(— 1 to do 

draw u' ^ 

draw a ^ t/[0,1] and compute 


if A > a then u" ^ u'; else rt" ^ 
if (n < //max)&(fi mod A^adp = 0) then 
cluster {u°,..., u"} into J subsets: 0i,... 
for j ^ 1 to J do 

Nj •(— sample size of 0^, Wj -(r- Nj/n; 


0 


J ; 


end 


compute parameters Xj and hj using Eqs. (4.4); 
compute fij using Eq. (3.7); 


h ^ Ej=i Wj^j; 

end 

end 

Algorithm 1: The complete algorithm for the adaptive IS with GM. /temp is the 
number of tempered iterations. tempering parameters. A^temp is 

the number of samples used in each tempered iteration. Atoi is the total number 
of samples drawn by the algorithm. Aadp is the number of samples drawn between 
two consecutive parameter updates. Amax is the maximum length of chain before 
the adaptation is terminated. 
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Fig. 5.1. (for example 1) The posterior mean computed with the four different MCMC schemes. 
The truth is also plotted for comparison. 


We now draw samples from the posterior of u{t) with four different MCMC 
schemes: prior-based IS, adaptive IS with Gaussian approximation, adaptive IS with 
Gaussian mixtures, and the random walk pGN (RW-pGN). In each MCMC scheme, 
3 X 10^ draws are generated. In the prior based IS, one simply proposes according to 
the prior distribution, and no adaptation is used. In the adaptive IS with Gaussian 
approximation, the proposal is restricted to be a single Gaussian (i.e. J = 1), and in 
this case clustering is not needed. In both of the adaptive IS methods, the parameters 
are updated after every 1000 draws, and the parameter adaptation is terminated in 
the last 10® iterations. We do not use tempering in this example. The RW-pCN 
algorithm used in this work iterates as follows 

1. propose Uproposed = \/ 1 - /S^Mcurrent + Pw , where W fJ-Q 

2. Let Unext = ^proposed with probability 

^ — min| 1, exp($(Upj.oposed) ^(^current))}; 


and let Unext = ^current with probability 1 — a. 

In this example we use /? = 0.1. Note that, in all the numerical examples, we choose 
the stepsize /3 so that the resulting acceptance probability is in the range 20% — 30%, 
as is recommended in [HH] . 

In Fig. |5.1[ we show the posterior mean computed by the four MCMC schemes, 
while the truth is also shown for comparison purpose. One can see that the results 
of the four algorithms are nearly identical, suggesting that all the algorithms can 
estimate the posterior mean to a similar level of accuracy. We then use the OMF as 
an indicative parameter and show the trace plots of it in Fig. 5.2 We see from the 


plots that the two adaptive IS algorithms achieve much faster mixing rate than the 
other two methods. To further compare the efficiency of the methods, we compute 
the autocorrelation functions (ACF) of various quantities with the samples drawn by 
the four methods, and plot the ACF results in Fig. |5.3| In particular, we plot the 


ACF of the OMF as a function of lag in Fig. 5.3 (left) and show the lag 1 ACF for the 
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Fig. 5.2. (for example 1) The trace plots of the OMF for the four different MCMC schemes. 


unknown u at each grid point in in Fig. 5.3 (right). It can be seen from the figure that, 
our adaptive algorithms with single Gaussian proposal and with mixtures both result 
in much lower ACF values than the other two methods. When comparing the two 
adaptive algorithms, the mixture proposal outperforms the single Gaussian. For the 
IS algorithms, the acceptance probability is also a useful performance indicator, where 
higher acceptance rates are usually preferred, while it is not the case for random walk 
algorithms [35]. In Fig|5.4| (left) we plot the acceptance probability as a function of 
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Fig. 5.3. (for example 1) Autocorrelation functions (ACF) for the four different MCMC meth¬ 
ods. Left: ACF of the OMF plotted as a function of lags. Right: the lag 1 ACF for u at each grid 
point. 



Iterations (x 10^) 

Fig. 5.4. (for example 1) Left: the 
ESS at each grid point. 



rate of the four MCMC schemes. Right: the 


iterations for all the methods. For the three IS algorithms, one can see that the two 
adaptive algorithms have significantly higher acceptance probability than the prior 
based method. Meanwhile, the acceptance probability of IS with mixtures is higher 
than that of the one with the single Gaussian. The effective sample size (ESS) is 
another common measure of the sampling efficiency of MCMC [15] . ESS is computed 
by 


ESS 


N 

1 + 2r’ 


where r is the integrated autocorrelation time and N is the total sample size, and it 
gives an estimate of the number of effectively independent draws in the chain. We 
computed the ESS of the unknown u at each grid point and show the results in Fig. |5.4| 
(right). Once again, the plots indicate that the adaptive algorithms produce much 
more effectively independent samples than the prior based IS and the RW-pCN, while 
the mixture proposal outperforms the single Gaussian one in most of the dimensions. 
In summary, in this simple nonlinear inverse problem, we show that our adaptive 
algorithms are significantly more efficient than the prior based IS and the RW-pCN. 
Meanwhile, the mixture proposal outperforms the single Gaussian one, indicating that 
the more flexible mixture representation does improve the efficiency. 
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Fig. 5.5. (for example 2) 100 samples randomly selected from the chain drawn by each method. 


5.2. A bimodal likelihood function example. Our second example is an 
artificially constructed bimodal problem. Once again we assume the unknown u G 
X = L'^{[0, 1]) and the prior is a zero mean Gaussian measure with the same covariance 
function Eq. (5.21 as the first example. We consider a bimodal likelihood function, 
given by, 


1 


1 , 


exp(—<i>(u)) cx exp(—-||u — sin(27rt)|j2) + exp(—-||it + sin(27rt)||2) 


and it can be verified that the $(•) chosen this way satisfies the Assumptions (6.1) 
in [5]. It is easy to see that the posterior distribution should have two modes: one is 
close to sin(27rt) and the other is close to — sin(27rt). 

We draw samples from the posterior of u{t) with the same four MCMC schemes 
used in the Hrst example, and in each MCMC scheme, 5 x 10® draws are generated. In 
both of the adaptive IS methods, the parameters are updated after every 1000 draws, 
and the adaptation is terminated in the last 10® iterations, with no tempering used. 
In the RW-pCN, we choose (3 = 0.5. In all the computations, 100 grid points are used 
to represent the unknown function u. 

As has been mentioned, the posterior distribution has two modes and we shall 
exam if the algorithms can capture both of them. In this respect, we randomly select 
100 samples from the chain generated by each algorithm and plot them in Fig. |5.5[ 
We can see that the results of each algorithm can capture the two models of the 
posterior. Next we shall compare the efficiency of the four algorithms. As before, 
we first show the trace plots of the OMF for the four algorithms in Fig |5.6| and one 
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Fig. 5.6. (for example 2) The trace plots of the OMF for the four different MCMC schemes. 


can see that the results of the two adaptive methods and pCN all obtain fairly good 
mixing results, while the prior based IS seems to have a much slower mixing rate than 
the other three. Fig. 5.7 (left) plots the ACF of the OMF as a function of lag and 
Fig. 5.7 (right) shows the lag 1 ACF for the unknown at each grid point. Both figures 
indicate that the adaptive IS with mixtures has the best performance in terms of 
ACF values. Fig. 5.8 (left) plots the acceptance rate against the number of iterations, 
which shows that the three IS algorithms perform very differently: the prior based 
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Fig. 5.7. (for example 2) Autocorrelation functions (ACF) for the four different MCMC meth¬ 
ods. Left: ACF of the OMF plotted as a function of lags. Right: the lag 1 ACF for u at each grid 
point. 




Fig. 5.8. (for example 2) Left: the acceptance rate of the four MCMC schemes. Right: the 
ESS at each grid point. 


The mean value of Different proposal 




Fig. 5.9. (for example 2) Left: the sample mean of the mixture-based IS method (solid) and 
that of the pCN method (dashed). Right: samples drawn by the mixture IS method and by the pCN 
method. 
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IS results in an acceptance rate less than 1%, the adaptive IS with one Gaussian 
results in a rate up to 17%, and that of the adaptive IS with mixtures rises to around 
80% as the iteration proceeds. We compute the ESS of each dimension and show the 


results in Fig. 5.8 (right), and we see that the ESS of the adaptive IS with mixtures is 


significantly higher than that of the other three methods, indicating that the adaptive 
IS with mixtures has a substantial advantage in this multimodal problem. 

Finally to understand the limitation of the proposed method, we test it on another 
bimodal likelihood function: 


1 


1 , 


exp(—$(u)) oc exp(—-||u — 2sin(27rt)||2) + exp(—-||u + 2sin(27rt)||2). 


We drew 5 x 10^ samples with the mixture based IS algorithm and with the pCN. We 
plot the mean of the samples drawn by both method in 5.9 (left), and in 5.9 (right), 
we plot 100 samples drawn by each algorithms. It can be seen from the figures that, 
both methods can only capture one mode of the posterior distribution, indicating 
that, the problem becomes challenging for our method and the pCN when the modes 
of the target distribution are far apart. 


5.3. Inverse heat conduction under model uncertainty. Our last example 
is the inverse heat conduction (IHC) problems, which consist of estimating tempera¬ 
ture or heat flux density on an inaccessible boundary from a measured temperature 
history inside a solid. These problems have been studied over several decades due 
to their importance in a variety of scientific and engineering applications |3]. The 
IHC problems become nonlinear if the thermal properties are temperature depen¬ 
dent, where the inversion is significantly more difficult than the linear ones. In this 
example we consider a one-dimensional heat conduction equation 


du 

Ih 


A 

dx 



(5.3) 


with initial u{x, 0) = Uo(x). Here x and t are the spatial and temporal variable, u{x, t) 
is the temperature, and c{u) is the temperature dependent thermal conductivity, and 
the length of the medium is L, all in dimensionless units. We now assume that a heat 
flux is injected through the left boundary {x = 0), yielding a Neumann boundary 
condition: 

Au(o,t) = q(t). 

The boundary condition (BC) at a: = L is subject to uncertainty: with probability 
0.8 it is 

■^u{L,t) = 0, (5.4a) 

and with probability 0.2 it is 

d 

—u(L,t) = —u. (5.4b) 

ox 

The interpretation is that, the system has two possible states: one with a perfectly 
insulted boundary at a; = L, and the other has heat diffusion at x = L. 

Suppose that we place a temperature sensor in the medium {x = Xg) and the goal 
is to infer the heat flux q(t) for t S [0, T], from the temperature history measured by 
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q(t) 




o 

sensor 






Fig. 5.10. Schematic diagram of the IHC problem. 


the sensor in the time interval. The schematic of this problem is shown in Fig. 5.10 
A similar problem without model uncertainty has been studied in |22j . 

In the simulation, we let L = 1, T = 2, c{u) = + 1, Xs = 0.9, and the initial 

condition be Uo(x) = 0. The temperature is measured 50 times (equally spaced) and 
the error in each measurement is assumed to be an independent zero-mean Gaussian 
random variable with variance 0.1^. We assume the prior on q{t) is a stationary 
zero-mean Gaussian process with a squared exponential covariance function: 


If — I ^ 

C(t,tO=exp(-^^), 


(5.5) 


where d = 0.3. The ’’truth flux” q(t) is a realization of the prior (shown in Fig. 5.121 


and the data is simulated with the generated flux q{t) and the boundary condi¬ 
tion (5.4b). In this problem the likelihood function becomes: 


-— = 0.8exp(-<I>i(u)) -I- 0.2exp(-<i)2(M)), 
dfj.0 


where $i(u) corresponds to Eq. (5.3) with BC (5.4a) and $ 2 (u) corresponds to 
Eq. ^ with BC dSAbl ). 

We draw samples from the posterior of u{t) with the four MCMC schemes used 
in the previous examples. In each MCMC scheme, 1.5 x 10® draws are generated. In 
both of the adaptive IS methods, the parameters are updated after every 500 draws, 
and the adaptation is terminated after 10® draws. To accelerate the convergence, 
we use tempering in the first 11 iterations (5,500 draws) with tempering parameter 
A = (i — 1)/10 for i = 1...11 . In the RW-pCN, we choose /3 = 0.1. 

We first show the trace plot of the OMF in Fig. (5.11), and it is quite clear that 
the results of the two adaptive methods are better than those of the prior based IS 
and the pCN. Because of the multimodality of the likelihood function, the posterior 
may have multiple modes, and to verify this, we apply the K-means method described 
in Section 4 to cluster the samples drawn by the four methods. The samples of the 
adaptive IS with mixtures can be successfully classified into two groups and we plotted 
the mean of each group in Fig. 5.12 (left), compared against the true heat flux. The 
K-means method, however, fails to separate the samples drawn by the other three 
methods, likely because the chains have not reached the target posterior distribution 
yet. We plot the means of the samples of the three methods in Fig. 5.12 (right). Like 
the previous examples we show the ACF results of the four methods in Figs. |5.13l and 
the acceptance rates and the ESS in Figs |5.14[ In all the plots, the adaptive IS with 
mixtures exhibits the best performance, followed by the IS with a single Gaussian. 


6. Conclusions. In conclusion, we have presented an adaptive IS algorithm for 
infinite dimensional Bayesian inference. Namely we choose a Gaussian mixture with 
a particular parametrization as our proposal, and adaptively adjust the parameter 
values using sample history. We prove that the proposed algorithm is well-defined 
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Prior Mixture 



Fig. 5.11. (for example 3) The trace plots of the OMF for the four different MCMC schemes. 




Fig. 5.12. (for example 3) Left: the means of the samples in each cluster of the chain drawn 
by the IS with mixtures and the true flux. Right: the means of the samples drawn by the adaptive 
IS with a single Gaussian, the prior based IS and the RW-pCN. 


in function space and thus is dimension independent. We also develop an efficient 
algorithm based on clustering to compute the parameter values in each iteration. We 
demonstrate the efficiency of the proposed method with numerical examples and in 
particular we show that it performs well for multimodal posteriors. We emphasize 
that, the proposed method is easy to implement, treating the problem as a black box 
model, and requiring no information on the mathematical structure of the forward 
model. 

As has been demonstrated by the numerical examples, the mixture proposals 
can generally provide faster mixing rates than the single Gaussian, thanks to its 
higher flexibility. On the other hand, given that the Gaussian approximation is less 
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Fig. 5.13. (for example 3) Autocorrelation functions (ACF) for the four different MCMC 
methods. Left: ACF of the OMF plotted as a function of lags. Right: the lag 1 ACF for u at each 
grid point. 



iterations (x 10^) 



Fig. 5.14. (for example 3) Left: the acceptance rate of the four MCMC schemes. Right: the 
ESS at each grid point. 


complex computationally (without the clustering step), we recommend to use the 
single Gaussian approximation in problems where the posterior distributions do not 
deviate too much from a Gaussian measure, and to use mixtures for strongly non- 
Gaussian posteriors. 

There are number of possible extensions of the work. First in this work we 
approximate the solution to the KLD minimization problem with clustering. It is 
possible that, if we can modify the standard EM algorithm and use it to solve the 
optimization problem directly, we may obtain better a mixture proposal in each iter¬ 
ation and improve the sampling efficiency. Secondly, the intrinsic dimensionality K 
is of essential importance for our method, and in the present work, K is determined 
rather heuristically. Thus developments of more effective and theoretically justified 
methods certainly deserve further studies. Finally, the algorithm developed here is 
based on independence sampler, and we are also interested in extending the ideas to 
the development of adaptive infinite dimensional random walk algorithms. We plan 
to investigate these problems in the future. 
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